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Q . This paper investigates maximizers of the information divergence from an 

exponential family S. It is shown that the rJ-projection of a maximizer P to 
CN \ £ is a convex combination of P and a probability measure P_ with disjoint 

support and the same value of the sufficient statistics A. This observation 
can be used to transform the original problem of maximizing \£) over the 
set of all probability measures into the maximization of a function D r over a 
O '. convex subset of ker A. The global maximizers of both problems correspond 

to each other. Furthermore, finding all local maximizers of D r yields all local 
maximizers of D(-\\£). 

q ■ This paper also proposes two algorithms to find the maximizers of D r and 

^vO . applies them to two examples, where the maximizers of Z)(-||£) were not 

^ ! known before. 
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X : 1 Introduction 



Let X be a finite set of cardinality N and consider an exponential family S on X . In 
this work this will mean that there exists a real- valued h x N matrix A (whose columns 
A x are indexed by x G X) and a reference measure r on X satisfying r(x) > for all 
x & X such that £ consists of all probability measures on X of the form 

Pe(x) = ^ exp ■ (1) 
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In this formula 9 G R is a vector of parameters and Zq ensures normalization. The 
matrix A is called the sufficient statistics of £. For technical reasons it will be assumed 
that the row span of A contains the constant vector (1,...,1), see section [2J The 
topological closure of £ will be denoted by £. 

The information divergence (also known as the Kullback-Leibler divergence or relative 
entropy) of two probability distributions P, Q is defined as 



Here we define OlogO = 01og(0/0) = 0. It is strictly positive unless P = Q, and it is 
infinite if the support of P is not contained in the support of Q. 

With these definitions Nihat Ay proposed the following problem, motivated by prob- 
abilistic models for evolution and learning in neural networks based on the infomax 
principle [2]: 

• Given an exponential family £, which probability measures P maximize D(P\\£)7 



Already [2] contains a lot of properties of the maximizers, like the projection prop- 
erty and support restrictions, but only in the case where the r /-projection P? of the 
maximizer lies in £. The projection property means that the maximizer P satisfies 
P(x) = P£(Z)Pg(x) for all x G Z := supp(P). In [T3] Frantisek Matiis computed the 
first order optimality conditions in the general case, showing that the projection property 
also holds if Pg G £\£. For further results on the maximization problem see [31 [TH [15]. 

In this work it is shown that the original maximization problem can be solved by 
studying the related problem: 

• Maximize the function D r (u) = XLgx u(x) log for all u G ker A such that 
||m||i < 2 and ^2 x u x = 0- 

Here, | \u\ |i is the ^-norm of u. Theorem [3] will show that there is a bijection between the 
global maximizers of these two maximization problems. Furthermore, knowing all local 
maximizers of D r yields all local maximizers of D(-\\£). This relation is a consequence 
of the projection property mentioned above. 

In Section [2] some known properties of exponential families and the information diver- 
gence are collected, including Matus's result on the first order optimality conditions of 
maximizers of D(-\\£). In Section[3]the projection property is analyzed. It is easy to see 
that probability measures that satisfy the projection property and that do not belong to 
£ come in pairs (P+, P-) such that P + — P_ G ker A\ {0}. This pairing is used in Section 
|4]to replace the original problem by the maximization of the function D r . Theorem [3] 
in this section investigates the relation between the maximizers of both problems. In 
Section \5\ the first order conditions of D r are computed. Section [H] discusses the case 
where dim ker A = 1, demonstrating how the reformulation leads to a quick solution of 
the original problem. Section [7] gives some ideas how to solve the critical equations from 




(2) 



Here D(P\\£ ) = inf Qe£ D{P\\Q). 
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Section [5j Section [8] presents an alternative method of computing the local maximizers 
of D(-\\S), which uses the projection property more directly. Sections [7] and [8] contain 
two examples which demonstrate how the theory of this paper can be put to practical 
use. 



2 Exponential families and the information divergence 



The definition of an exponential family, as it will be used in this work, was already 
stated in the introduction. It is important to note that the correspondence between 
exponential families £ on one side and sufficient statistics A and reference measure r on 
the other side is not unique. One reason for this lies in the normalization of probability 
measures: We can always add a constant row to the matrix A without changing £ (as a 
set). For this reason in the following it will be assumed that A contains the constant row 
(1, . . . , 1) in its row space. This implies that every u G kei A satisfies J2 x ex u ( x ) = 0- 

In order to characterize the remaining ambiguity in the parametrization (r, A) \— > 8 , 
denote by £ t ^a the exponential family associated to a given matrix A and a given reference 
measure. Then £ r ,A = £ r ',A' as sets if and only if the following two conditions are satisfied: 

• r G £ r > ■ .a 1 - 

• The row span of A equals the row span of A'. 

The introduction also featured the definition of the information divergence. In the 
following we will also use formula ([2]) for positive measures Q which are not necessarily 
normalized. In this case 



where J2 X P(x) = 1 was used. 

The following theorem sums up the main facts about exponential families: 

Theorem A. Let P be a probability measure on X . Then there exists a unique probability 
measure Pg in £ such that AP = APg. Furthermore, Pg has the following properties: 

1. For allQ G £ 




X 



for all A > 0, 





(4) 



2. Pg satisfies 




(5) 



3. Pe maximizes the concave function 




X 




subject to the condition AQ = AP. 
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Sketch of proof . Corollary 3.1 of [7] proves existence and uniqueness of P^ and the 
"Pythagorean identity" (jlj) for all probability measures P and all probability measures 
Q £ £. It follows from (J3J that 



D(P\\r) = D(P\\P £ )+D(P £ \\r), 



(7) 



so statements 2. and 3. follow from H r (Q) 



D(Q\\r). 



□ 



Pg is called the r T -projection of P to £, or simply the projection of P to £. 

Note that the function H r introduced in the theorem satisfies H r (P) = —D(P\\r). It 
can thus be interpreted as a negative relative entropy. In this work H r is prefered to 
its negative counterpart in order to keep the connection to the entropy H visible in the 
important case that r(x) = 1 for all x £ X. 

The map associated to the matrix A is called the moment map. It maps the set of all 
probability measures on X onto the polytope Ai which is the convex hull of the columns 
of A. This polytope is called the convex support of £. In the special case that £ is a 
hierarchical model (see [12]), Ai is called the marginal polytope of £. 

Note that we can associate a point A x £ Ai with each state x £ X. Among these 
points are the vertices of Ai, but not every point A x needs to be a vertex of Ai. 

Theorem B. Let P + be a (local) maximizer of D(-\\£) with support Z = supp(P + ) and 
Pg its r I -projection to £. Then the following holds: 

1. P + satisfies the projection property, i.e., up to normalization P + equals the re- 
striction of Ps to Z: 



2. Suppose y := supp(Pf) ^ X . Then the moment map maps y and X \ y into 
parallel hyperplanes. 

3. The cardinality of Z is bounded by dim £ + 1 . 

Proof. Statements 1. and 3. were already known to Ay [2] in the special case where 
y = X . The general form of statement 3. is Proposition 3.2 of [15]. Statement 2. and 
the generalization of statement 1. are due to Matus[T3l Theorem 5.1]. □ 

The paper [13] contains further conditions on the maximizer. However, these will not 
be studied in this work. 

Definition 1. Any probability measure P that satisfies (jSJ) will be called a projection 
point. If P satisfies conditions 1. and 2. of Theorem IB"1 then P will be called a quasi- 
critical point of D(-\\£), or a D- quasi- critical pomo 

convex analysis, a point satisfying all first-order conditions (which in general comprise both equa- 
tions and inequalities) of a convex function is called a critical point. In analogy to this, the term 
"quasi-critical" point is chosen in this work for a point which satisfies only the equations derived 
from the first order conditions of an arbitrary function. 
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3 Projection points 



In this section assume that A does not have full rank. Otherwise the function P(-||£) is 
trivial. 

Let P + be a projection point, and let Pg be its projection to £. Denote Z = supp(P + ) 
and y = supp(P£-). Every measure P\ := \P + + (1 — X)Pe on the line through P + and Pg 
is normalized and has the same sufficient statistics as P + and Pg. Fix A_ = — j^^rg) ■ 
Then 

{ P £ (Z) P £ (X) 1 P (~\ _ n if ry- (Z 7 

l-Pe(Z) P e (Z) + l-P s {Z) r £\ x ) - u 11 x ^ *i (q) 
(l-A_)P,(*) = ±±g§P,(*)>0 else. (J) 

Thus P_ := P\ is a probability measure with support equal to y\Z, and u := P + — P_ 
lies in the kernel of A. Furthermore, P_ is a second projection point with the same 
projection Ps to £ as P + . 

The projection Pg can be written as a convex combination of P + and P_, i.e., Pf = 
/iP + + (1 — /i)P_, where /i = A A ~ 1 G (0, 1). Since the supports of P + and P_ are disjoint 
we have \i = Ps(Z) and (1 — fi) — Pe(X \Z). In other words, 

P £ (x) = X G (10) 

There are a lot of relations between P + , P_ and P^. They will be collected in the 
following Lemma in a slightly more general form. 

Lemma 2. Let P + and P_ be two probability measures with disjoint supports such that 
AP + = AP_. Let P be the unique probability measure in the convex hull of P + and P_ 
that maximizes the function 

ff r (Q) = -£Q(*)tog^. (li) 

Define \i = P(Z), where Z = supp(P + ). Then the following equations hold: 

exp(P r (P)) = exp(P r (P+)) + exp(P r (P_)), (12a) 

exp(P r (P + )-P r (P_)), (12b) 



D(P+\\P) = H r (P £ ) - H r (P + ) = log(l + exp(if r (P_) - H r (P + ))). (12c) 
Proof. The first observation is 

H r (P £ ) = ^H r (P + ) + (1 - ^)P r (P_) + h{n, 1 - n), (13) 
where h(/i, 1 — /x) = — /ilog(/i) — (1 — /i) log(l — ji). 
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Since Pg maximizies H r among all probability measures with the same sufficient statis- 
tics as P+ and P_, it follows that 



d {fi'H r (P + ) + (1 - fjf)H r (P.) + h(n', 1 - n r )) 



d\i! 



= H r {P + ) - H r {P_) + log(l - - log(/i) 

must vanish, which rewrites to 

/' 



1 — fi 

or 



exp(# r (P + )-# r (P_)), (14) 



exp(PT r (P + )) 1 

^ exp(if r .(P + ))+e X p(iy r (P_)) l + exp(P r (P_) - H r (P+)Y 1 J 

This implies 

h(n,l-n) = -fJ.H r {P+) + /ilog(exp(P r .(P + )) +exp(P r (P_))) 

- (1 - /i)P r (P_) + (1 - n) log (exp(P r (P + )) + exp(P r (P_))) 
= -fxH r {P+) - (1 - /i)PT r (P_) + log (exp(# r (P + )) + exp(# r (P_))) . 

Comparison with equation (lT5j) yields 

exp{H r {P e )) = exp(# r (P + )) + exp(# r (P_)), (16) 

which in turn simplifies ( II 5 p to 

^ = exp(H r (P + )-H r (P £ )). (17) 

The Kullback-Leibler divergence equals 

D(P + \\P £ ) = ^p + ( x )log-i— = -log(/x) (18a) 

= H r {P £ ) - H r {P + ) (18b) 
= log(l + exp(# r (P„) - P r (P + ))). (18c) 

□ 

As an easy consequence 

exp(-P(P + ||£)) + exp(-P(P_||£)) = l, (19) 

from which we see that in general P + and P_ will not be both maximizers of D{-\\£). 
Furthermore it follows that D(P\\E) > log(2) for any global maximizer P (assuming 
that A does not have full rank). 
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4 Decomposition of Kernel Elements 



Now suppose that u is an arbitrary nonzero element from the kernel of A. Then u = 
u + — u_, where u + and u_ are positive vectors of disjoint support. Since A contains the 
constant vector (1, . . . , 1) in its rowspan, it follows that the £i-norms of u + and U- are 
equal. Thus u = d u (P + — P-), where d u = ||w+||i = ||u-||i = > is called the 

degree of u and P + and P_ are two probability measures with disjoint supports. Since 
P + and P_ have the same image under A, they have the same projection to S, which 
will be denoted by Pg. 

Let P be the convex combination of P + and P_ that maximizes H r . Note that in 
general P ^ Pg. Still Lemma [2] applies. Furthermore 

D(P + \\S) = H r {P e ) - H r (P + ) > D(P+\\P) = H r (P) - H r (P + ), (20) 

since Pg maximizes H r when the image under A is constrained (see Theorem 1X1) . 

These facts can be used to relate two different optimization problems. The first 
one is the maximization of the information divergence from S. The second one is the 
maximization of the function 

D r : kerA -> R, u i-> Vu(x)log^r ( 21 ) 

r{x) 

subject to the constraint d u = |||w||i = 1. From what has been said above, if d u = 1 
then u = Q + — Q- for two probability measures Q + ,Q~ with disjoint support, and in 
this case 

D r (u) =H r {Q_) -H r {Q_). (22) 

Since D r is a continuous function from the compact £i-sphere of radius 2 in kerA, a 
maximum is guaranteed to exist. 

Theorem 3. Let 8 be an exponential family with sufficient statistics A. 

1. Ifu = Q + —Q^ G kerv4\{0} is a global maximizer of D r subject to d u = = 1, 
then the positive part Q + of u globally maximizes D(-\\S). 

2. Let P + be a local maximizer of the information divergence. There exists a unique 
probability measure P_ with support disjoint from P + such that P + — P_ G kerA 
is a local maximizer of D r . If P + is a global maximizer, then P + — P_ is a global 
maximizer. 

Proof. (1) Consider global maximizers first: 

Choose probability measures Q + and Q_ of disjoint support such that u = Q + — Q_ 
maximizes D r . Denote by Q the probability measure from the convex hull of Q + and 
Q_ that maximizes H r . In addition, let P + be a global maximizer of D(-\\£). Construct 
P_ as in section [3j From f)12cl) and fl20l) it follows that 

log(l + exp(P (r (P_) - H r {P + ))) = D(P + \\E) > D(Q + \\£) 

> D(Q + \\Q) = H r (Q) - H r (Q + ) 

= log(l + exp(# r (Q_) - H r (Q+))). (23) 
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The maximality property of Q + — Q_ implies that all terms of f[23l) are equal. This 
proves the global part of the theorem. 

(2) Now suppose that P + is a local maximizer of the information divergence and define 
P_ as above. Choose a neighbourhood U of P_ such that D(P'\\E) < D(P + \\S) for all 
P' G U. Since the map u \— > («+,«_) is continuous, there is a neighbourhood C/ 7 of 
P+ - P_ such that - <3- £ ^' Q'+ £ ^ for all probability measure Q+, It 
follows that 

log(l + exp(tf r (P„) - H r {P+))) = D{P + \\8) > D(Q' + \\£) 

> H r (Q>) - H r (Q' + ) = log(l + exp(H r (Q'_) - H r (Q' + ))). 

for all Q' + — Q'_ from the neighbourhood U' of P + — P_. Thus P + — P_ is a local 
maximizer. 

P_ is unique since it is characterized as the unique maximizer of the concave function 
H r under the linear constraints P + — P_ G kerA and supp(P + ) fl supp(P_) = 0. □ 

Remark 4. There are several possibilities to reformulate the problem of maximizing D r . 
To see this, note that D r is homogeneous of degree one, since 

\ u (x)\ ( \ 

DJau) = a > u(x) log — -— + a > u(x) log \a\ = aDJu) (24) 

for all it G kerA and a G R. This means that, when maximizing D r , the constraint 
d u = 1 is equivalent to d u < 1. Under the inequality constraint the maximization is over 
a polytope, while under the equality constraint the maximization is over the boundary 
of the same polytope. 

A third alternative is the maximization of the function 

Dl : keiA \ {0} -> E, u ^ —D r (u). (25) 

The solutions of this last problem need to be normalized in order to compare this max- 
imization problem with the formulations. 

Remark 5. It is an open question when the projection Pg of a maximizer P + lies in the 
interior of the probability simplex. More generally one could ask for the support of Pg. 
Since supp(P^) = supp(P + — P_) this question can also be studied with the help of the 
theorem. 

In many cases the support of Ps will be all of X. However, the construction of Example 
[TU] shows that P% can have any support (of cardinality at least two). See also [T5] . 

5 First order conditions 

Theorem [3] implies that all maximizers of P(-||£) are known once all maximizers of D r 
are found. The latter can be computed by solving the first order conditions. To simplify 



8 



the notation define 

(26) 

if u G ~R X is any vector and B C X. 

Proposition 6. Let u G ker A be a local maximizer of D r subject to d u = The 
following statements hold: 

1. v (u = 0) := Y JX -u( x )=o v ( x ) = f or allv G ker A. 

2. u satisfies 

E v ^ iog^+ E lo § ^rr ^ ( 2? ) 

for all v G ker A, where d' u (v) := v{u > 0) +v + {u = 0). 

3. If v G ker A satisfies supp(f) C supp(-u), then 

5>(*)log!^ = <(«)A.(«). (28) 

Proof. First note that the degree d„ = J2 X v +( x ) = 12 x v -( x ) = f I M |i is piecewise linear 
in the following sense: 

• Let u, v G ker A. Then there exists Ai > such that 

d u +xv = d u + Xd'Jv) for all < A < Ai, (29) 

where d' u (v) = J2 x:u >o v ( x ) + J2 x -. u =o v +( x ) = v ( u > °) + v +( u = 0) G M depends 
only on u and v (but not on A). 

Fix u, v G ker A. If e > is small enough then 

TJ r ( n + ev) = V u(x) log + V n(x) log ( 1 + 6^4 ) 

+ e E] ^O 37 ) l°g 



+ ev(x)\ 



r(x) 

v(x)\ 



D r {u) + e ( E "0*0 lo g + E u ( x ) lo § 



r(x) 

+ elog |e|u(ii = 0) + ev(u ^ 0) + o(e), 
where log(l + ex) = 1 + ex + o(e) was used. 
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Using ([29]) and d25J) yields 

D\{u + ev) = D r (u) - e^P-D r (u) 



€ I sr^ i \ i 1^0*01 / m \ v ( x )\ 1 

+ 4£. BWlos ^ + £»" wlog ^J 

+ 3-elog \e\v{u = 0) + ev(u 7^ 0) + o(e) 



<4 



+ d u 



e 



Vc:u/0 ^ ' x:u=0 ^ > J 



+ — e log |e|u(u = 0) + ev(w ^ 0) + o(e). (30) 

Now let u be a local maximizer of D r in ker A subject to d u = 1. Then it is also a 
local maximizer of _D r by Remark HI Therefore the first statement follows from the facts 
that the derivative of eloge diverges at zero and the coefficient j-v(u = 0) changes its 
sign if v is replaced by —v. Since v(u 7^ 0) = v(X) — v{u = 0) = the inequality follows 
for all v G ker A. If supp(t>) C supp(w) then d' u (— v) = —v{u > 0) = — d' u (v). In this 
case the left hand side of the inequality changes its sign when v is replaced by —v, thus 
it holds as an equality. □ 

Definition 7. A point u G ker A is called a quasi- critical point of D T if it satisfies the 
conditions [TJ and [31 of proposition [6j 

The importance of this definition is that every local extremum of D r is also a quasi- 
critical point by the above proposition. This means that any convergent numerical 
optimisation algorithm will at least find a quasi-critical point. 

Remark 8. Condition [TJ of Proposition [B] depends on u only through the support of u. 
Therefore it can be used as a necessary condition to test whether a maximizer of D r 
can have a given support. Since this equation is linear in v it is enough to check it for 
a basis of ker A. 

Remark 9. Condition of Proposition O is also linear in v, since d' u (v) — v(u > 0) is 
linear in this case. Moreover, it is trivially satisfied for v = u. This means that it is 
enough to check condition [3] on a basis of any subspace K C ker A such that the span 
of K and u contains all v G ker A with supp(f) C supp(u). A possible choice is 

K u = {v G ker A : supp(w) C supp(u) and d'Jy) = 0}. (31) 

In this subspace, the equations of proposition 0, [3J simplify to 

E, . , \u(x) I 
^logL^Uo 32 
r(x) 

for all v G K. 
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Figure 1: The binary independence model. 



6 The codimension one case 



In this section the theory developed in the previous sections will be applied to the case 
where the exponential family has codimension one. 

Example 10. If ker A is onedimensional, then it is spanned by a single vector u = P + — P_, 
where P + and P_ are two probability measures. If H r (P + ) = H r (P_), then both P + and 
P_ are global maximizers of D(-\\S). Otherwise assume that H r (P + ) < P r (P_). Then 
P + is the global maximizer of D(-\\S). Note that — u is another local maximizer of D r . 
It is easy to see that P_ is also a local maximizer of D{-\\£). 

This example can serve as a source of examples and counterexamples. For example, it 
is easy to see that for a general exponential family, supp(P£) can be an arbitrary set 3^ of 
cardinality greater or equal to two: Just choose two measures P+, P_ of disjoint support 
such that supp(P + ) U supp(P_) = y, let u = P + — P_ and choose a matrix A such that 
ker A is spanned by u. In the same way one can prove the following statements: 

• Any set y C X with cardinality less than \X\ — 1 is the support of a global 
maximizer P of P(-||£) for some exponential family £. 

• Any measure supported on a set y C X with cardinality less than \X\ is a local 
maximizer of Z?(-||£) for some exponential family 8. 

• Any measure supported on a set y C X with cardinality less than \X\ — 1 is a 
global maximizer of P(-||£) for some exponential family S . 

Of course, these statements are not true anymore, when the reference measure is fixed 
or when the class of exponential families is restricted in any way. 

Example 11. As a special case of the previous example, consider the binary independence 



model with X = {00, 01, 10, 1 



1} 



.00 01 10 11 



.4 



/l 1 0\ 

11 

10 10 

\0 1 1/ 



(33) 
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and r(x) = 1 for all x G X. It is easy to see that £ consists of all probability measures 
P which factorize as P(x\X2) = Pi(xi)P2(x2), justifying the name of this model. The 
kernel is spanned by 



figure [Q) . 

7 Solving the critical equations 

Finding the maximizers of D r has some advantages over directly finding the maximizers 
of D(-\\S), mainly because of two reasons: 

1. The dimension of the problem is reduced: Instead of maximizing over the whole 
probability simplex the maximization takes place over a convex subset of the ker- 
nel of the matrix A. Therefore the dimension of the problem is reduced by the 
dimension of the exponential family. 

2. A projection on the exponential family is not needed: D r can be computed by a 
"simple" formula. 

A numerical search for the maximizers using gradient search algorithms is now feasible 
for larger models. However, there may be a lot of local maximizers, so it is still a difficult 
problem to find the global maximizers of D(-\\S). Of course, the above ideas can also 
be used with symbolic calculations in order to investigate the maximizers of D(-\\S). 

In the following assume that the sufficient statistics matrix A has only integer entries. 
In this case the ker A has a basis of integer vectors. An important class of examples 
where this condition is satisfied are hierarchical models. 

Under these assumptions we turn to the equations of Proposition [6j The main obser- 
vation is that equation fl28l) is algebraic for suitable u once we fix the sign vector of u. 
This motivates to look independently at each possible sign vector a that occurs in ker A. 

Remark 12. Before investigating the critical equations some short remarks on the sign 
vectors are necessary. The set of possible sign vectors occuring in a vector space (in this 
case ker A) forms a (realizable) oriented matroid. A sign vector a is called an (oriented) 
circuit if its support {x G X : o~ x ^ 0} is inclusion minimal. See the first chapter of [5] 
for an introduction to oriented matroids. 

Every sign vector can be written as a composition a\ o • • • o o~ n of circuits, where o is 
the associative operation defined by 



There is a free software package T0PC0M|T6] which computes the signed circuits of a 
matrix. However, this package does not (yet) compute all the sign vectors, but this 
second step is easy to implement. 





(35) 
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There is a second possible algorithm for computing all sign vectors of an oriented 
matroid, which shall only be sketched here, since it uses the complicated notion of 
duality (see [5] for the details): Namely, the set of all sign vectors is characterized by the 
so-called orthogonality property, meaning that the set of all sign vectors can be computed 
by calculating all cocircuits and checking the orthogonality property on each possible 
vector cr G {0, ±1}*. 

The nonzero sign vectors a occuring in a vector space always come in pairs ±cr. It is 
customary to list only one representative of each such pair. This is not a problem, since 
the function D r is antisymmetric, i.e., a local maximizer u with sign vector sgn(w) = —o 
corresponds to a local minimizer — u with sign vector a, and both will be quasi-critical 
points of D r . 

Now fix a sign vector a and choose Uq G ker A such that sgn(w ) = a and d uo = 1. 
Denote y := supp(cr) = supp(u ). Define d a (v) := Ylx-cr >o v(x). This implies d a {v) = 
d' u (v) whenever supp(u) C supp(wo) = supp(a). Let 

K a := {v G ker A : d a (v) = and supp(v) C supp(a)}. (36) 

If u G kerA satisfies d u — 1 and sgn(u) = a, then u — uq G K° ' . By definition u is a 
quasi-critical point of D r if and only if 

V v(x) log = for all v G K a (37) 

^— ' r(x) 

(see Remark [9]). These equations are linear in v, so it is enough to consider them for a 
spanning set of K a . Since by assumption the matrix A has only integer entries the set 

K% := K u fl (38) 

contains a spanning set of K a '. Therefore u is a quasi-critical point of D r if and only if 



|m(x)| 



^ log = for all v G K%. (39) 

Exponentiating these equations gives 

n (^f = n (^f)"" w — ^ <-) 

i£}':»(2')>0 V V ^ 7 xGy:^(a:)<0 V V ' 7 

This is a system of polynomial equations. Every solution u G Uq + to this system 
that satisfies sgn(w) = a is a quasi-critical point of D r and thus a potential maximizer. 

At this point it is possible to do one more simplification: If v G K%, then v(o~ < 0) = 
v(a ^ 0) - v(a > 0) = 0. It follows that v + (a < 0) + v^(a < 0) = 0, so 

n m v{x) = (-i) u+(<t<o) = (-iy-^ = n {a x )- v{x) (4i) 

x:v(x)>0 x:v(x)<0 

All in all this yields: 
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Proposition 13. Fix a sign vector a £ {±1} . Let u £ kerA satisfy d u — 1 and 




(42) 



/or a/Z w = f + — v - £ iC^. Ifsgn(u) = a, then u is a quasi- critical point of D r . Every 
quasi- critical point of D r arises in this way. 

Remark 14. Note that the system of equations (142]) still contains infinitely many equa- 
tions. The argument before equation (138]) shows that a finite number of equations is 
enough. However, there are different possible choices for this finite set (at least a basis 
of K% is needed), and the choice may have a large computational impact. This issue 
will be addressed below. 

Proposition [12] shows that the maximizers of D r can be found by analyzing all the so- 
lutions to the algebraic systems of equations (]42j) for all different possible sign vectors a. 
Since the analysis of systems of polynomials works best over the complex numbers, in the 
following these equations will be considered as complex equations in the variables u(x). 
Of course, only real solutions with the right sign pattern will be candidate solutions of 
the original maximization problem. 

From now on fix a again. Define 7f — C[u(x) : x £ y] to be the z<iea/H| generated 
by all equations f]4*2]) in the polynomial ring C[u(x) : x £ 3^] with one variable for each 
x £ y. Similarly, let If C C[u(x) : x £ y] be the ideal generated by the equations 



Finally let I a := 7f + 1% . The set of all common complex solutions of all equations in 
I a is an algebraic subvariety of and will be denoted by X a . 

Remark 15. Note that we omitted the equation d u — 1 = in the definition of the 
ideal. It is easy to see that we can ignore this condition at first, because every solution 
satisfying sgn(-u) = o has d(u) ^ and can thus be normalized to a solution with d u = 1. 
In other words, the original problem is solved once all points on the variety X a that 
satisfy the sign condition are known. The algebraic reason for this fact is that all the 
defining equations of I are homogeneous. This means that we can also replace X a by 
the projective variety corresponding to I a , which is another interpretation of the fact 
that the normalization does not matter at this point. 

Both ideals 1° and 7^ taken for themselves are very nice: 1^ corresponds to a system 
of linear equations, so it can be treated by the methods of linear algebra. On the other 
hand, is a system of binomial equations, and there are a lot of theoretical results and 
fast algorithms for binomial equations [8] [11]. However, the sum of a linear ideal and a 

2 The mathematicel disciplines of studying polynomial equations and their solution sets are commu- 
tative algebra and algebraic geometry. In the following some definitions from these two fields are 
used. The reader is refered to [B] for exact definitions and the basic facts. 




for all i. 



(43) 
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binomial ideal can be arbitrarily complicated. In fact, it is easy to see that any ideal 
can be reparameterized as a sum of a linear ideal and a binomial ideal: For example, a 
polynomial equation X]j m » = 0' where rrtj are arbitrary monomials, is equivalent to the 
system of equations 

Zi — rrii = 0, for all i, 

where one additional variable zi has been introduced for every monomial. Still, the 
two ideals 11 and if under consideration here are closely related, so there is hope that 
general statements can be made. 

X a equals the intersection of X° and X% , where Xf and X^ are the varieties of If 
and X% respectively. The variety X° is easy to determine: By definition it is given by 
the (complex) kernel of A restricted to y-. 

XI = ker c AnC y . (44) 

The variety X% is a little bit more complicated, but still a lot can be said. 

By definition, If is generated by a countable collection of binomials. In fact, Hilbert's 
Basissatz shows that a finite subset of the generators of 7f is sufficient to generate 
the ideal. In general it can be a difficult task to find such a finite subset, but since 
equations (1421) correspond to directional derivatives, it is sufficient to consider them for 
any basis B of K a (see Remark [T4"|) . So denote the ideal generated by the equations 
corresponding to a basis B of kerA by 12(B). In general 12(B) will have a different 
solution variety V (12(B)) than 1%, and moreover V (12(B)) will depend on B. From 
what was said above all these varieties agree on the orthant of ~R X defined by sgn = a. 
The presence of additional (complex) solutions outside this orthant may complicate the 
algebraic analysis. It is obvious that all the ideals h(B) are contained in 1% . This means 
that 1% has the smallest solution set, so a finite generating set of 1% would be useful. 

More precisely, since the ideal If is generated by binomials, the theory of [8] applies. 
Corollary 2.6 of this work implies that the ideal 1% is a prime ideal. This means that Xf 
is irreducible, i.e., it can not be written as a union of two proper subvarieties. Binomial 
prime ideals are also called toric ideals [HI remark before Corollary 2.6]. However, it is 
easy to construct examples such that 12(B) is not irreducible. 

Fortunately there are fast computer algorithms, implemented in the software package 
4ti2[TJ ) which can be used to compute a finite generating set of 1% [10]. These algorithms 
compute finite generating sets of so-called lattice ideals. It turns out that 1% becomes a 
lattice ideal after a rescaling of the coordinates. To be concrete, writing u r (x) := ^£4 
yields a new, equivalent ideal 1% r Q C[u r (x) : x G y] generated by the binomials 

Yl u r (x) v{x) = Yl u r (x)- v(x) , for all v E K° % . (45) 

x:v>0 x:v<0 

The ideal I% r is called a lattice ideal, since it is related to the integer lattice K% C Z^. 
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Now we turn to X a = X° fl X%. Even though Xf and X% are irreducible, in general 
X a will be reducible. This means that we can write X a as a finite union of irreducible 
components X a = V° U • • ■ U V° . To each of these components Vf corresponds a 
polynomial ideal If, and we have u G X a if and only if u solves (at least) one of these 
ideals. The procedure to obtain the ideals If is called primary decomposition. 

If an irreducible component Vf is zero-dimensional, then it consists of only one point, 
and it is easy to check whether this unique element u G Xf satisfies sgn(w) = a. However, 
components of positive dimension may arise. In this case it is not easy to see whether 
these components contain elements u satisfying sgn(w) = a. Fortunately, in many cases 
this information is not required: 

Theorem 16. Let u be an element of an irreducible component V of X a such that 
d a {u) = 1. Suppose there exists u G V such that d a (u ) = 1 and sgn(w ) = °~ ■ Then 

D r (u ) = V SRe(«(x)) log M^l. (46) 

Proof. Let 

V':={veV: v(x) ^ for all x G y, and d a {v) ^ 0}. (47) 

Then V is a Zariski-open subset of V, hence V is irreducible. This implies that V 
is pathconnected, so there exists a smooth path 7 : [0, 1] — > V from u to uq. This is 
obvious if V is regular, since then V is a locally pathconnected and connected complex 
manifold. It follows that all regular points can be connected by a smooth path. Finally, 
every singular point p can be linked by a smooth path to some regular point in any 
neighbourhood of p. By Remark [TBI this path can be chosen such that d a ( , ~ft) = 1 for all 

te [0,1]. 

Fix a point u G V and fix a convention for the logarithm. For every x G y the 
logarihm can be continued to a map t \—> log*— (7^(2;)). For every t G [0, 1] define a linear 
functional s t : Kf. — > C via 



— Tv(x)lo g ^^l. (48) 
2vri ^ v ' 6 r x^ 

xey 



By definition of X G it follows that s t takes only integer values on K%, and s t can be 
identified with an element of the dual lattice K%* of K%. Since K%* is a discrete subset 
of the dual vector space Kf.* and since the map 1 1 — is continuous St is constant along 
7- 

Now consider the function f(t) = Ylxey 1t{ x ) l°g*' x y r{x) )- ^ s derivative is f'(t) = 
£^7^) W'* = 2ms M, where j' t (x) = fa t (x) G K%. It follows that 

/(l) — /(0) = 27rzso(7i — 7o). In other words, 

^ u (x) log 1 - ™M = Y: u(x) log ^ + 2ms (u - u) . (49) 
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X 



) 



(50) 



Taking the real parts of this equation gives 



D r (u ) = 9te(/(0)) + 27rso(2fm(u)) 

= ^e(/(0))-z^Jm(n(x))log 



cr x u(x) 
r(x) 



x&y 



= <Ke ( ^<Kc(w(x))log 
= £ «*(«(*)) log 1^ 





) 



By continuity this formula continues to hold on the closure of V, which equals V. □ 

The theorem implies that in many cases only one point u from each irreducible com- 
ponent of X a needs to be tested. Only if ^2 xe y 9iz(u(x)) log-^j^ is exceptionally large 
it is necessary to analyze this irreducible component further and see if there is a real 
point uq from the same irreducible component that satisfies the sign condition. 

Remark 17. The above theorem also makes it possible to use methods of numerical alge- 
braic geometry |17j. These methods can determine the number of irreducible components 
and their dimensions. Additionally it is possible to sample points from any irreducible 
component. In fact, each component is represented by a so-called witness set, a set of 
elements of this component. These points can then be used to numerically evaluate D r . 
One implementation, available on the Internet, is Bertinijl]. 

Let x Ey. For every irreducible component Xf there are the following alternatives: 

• Either u(x) = for all u G X[. In this case sgn(u) ^ a on Xf. 

• Or u(x) = holds only on a subset of measure zero. 

The reason for this is that the equation u{x) = defines a closed subset of Xf , and 
either this closed subset is all of X?, or it has codimension one (this argument needs the 
irreducibility of X?). 

When computing the primary decomposition the irreducible components of the first 
kind can be excluded algebraically by a method called saturation: Namely, the variety 
corresponding to the saturation 



r-.(l[u(x)r \={feC[u(x)] 



fm G I a for some monomial m G C[u(x)] > (51) 



x&y 
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consists only of those irreducible components of X a which are not contained in any 
coordinate plane. In the same way we may also saturate by the polynomial d a (M), 
since any solution M with sgn(M) = a will have 7^ d(M) = d a (M). 

The main reason why saturation is important is that it may reduce the complexity of 
symbolic calculations. 

Example 18. The above ideas can be applied to the hierarchical model (see [12J) of 
pair interactions among four binary random variables (the "binary 4-2 model"). This 
exponential family consists of all probability distributions of full support which factor 
as a product of functions that depend on only two of the four random variables. 

The maximization problem of this model is related to orthogonal latin squares: If the 
binary random variables are replaced by random variables of size k, then the maximizer 
of the corresponding 4-2 model is easy to find if two orthogonal latin squares of size k 
exist [14]. and in this case the maximum value of _D(-||£) equals 21og(fc). From this point 
of view, the following discussion will give an extremely complicated proof of the trivial 
fact that there are no two orthogonal latin squares of size two. 

The sufficient statistics may be chosen as 



.4 



4-2 



/I 



1 



1 




1 




1 




1 






o\ 
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Here, the columns are ordered in such a way that the column number i + 1 corresponds to 
the state Xi £ X = {0, l} 4 that is indexed by the binary representation of i £ {0, ... , 15}. 

The software package TOPCOMPS] is used to calculate the oriented circuits of ker A, 
from which all sign vectors are computed by composition. Up to symmetry there are 73 
different sign vectors occuring in ker A. Here, the symmetry of the model is generated 
by the permutations of the four binary units and the relabelings «-> 1 of each unit. 

From these 73 sign vectors only 20 satisfy condition [U of Proposition [6j The sign 
vectors of small support are easy to handle: There are two sign vectors <Ti,cr 2 whose 
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support has cardinality eight. They are in fact oriented circuits, which implies that, up 
to normalization, there are two unique elements Ui,u 2 G ker A such that sgn(uj) = <7j, 
i — 1,2. They satisfy D r (ui) = 0, so they are surely not global maximizers. 

There are three sign vectors whose support has cardinality twelve. Let a be one of 
these. Then the restriction supp(w) C supp(cr) selects a two-dimensional subspace of 
ker A, and it is easy to see that D r = on this subspace. 

There remain 15 sign vectors that have a full support. For every such sign vector a 
the system of the algebraic equations in 1° and has to be solved. To reduce the 
number of equations and the number of variables one may parametrize the solution set 
kerc A of I" by finding a basis ui, . . . , u$ of ker A. Then this parametrization is plugged 
into the equations of 1%. Some of these systems are at the limit of what today's desktop 
computer can handle. Therefore care has to be taken how to formulate these equations. 
The general strategy is the following: 

1. At first, compute a basis Vi, . . . ,Vk-i of K% by using a Gram-Schmidt-like algo- 
rithm: Renumber the Wj such that d a (u 5 ) ^ and let 

d a {u b ) d a {ui) 

Vi := Ui u 5 , (52) 

9 9 

where g = gcd(d a (u 5 ),d a (ui)). 

2. Let / be the ideal in the variables Ai, . . . , A 5 generated by the equations 

Y[ u(x) Vi & - Y[ u{x)- v > {x \ for alU = 1, . . . , 4, (53) 

x:vi>0 x:vi<0 

where u{x) = Y^h=i \ u i( x )- 

3. Compute the saturation J = (I : Ylxex u ( x )°°)- 

4. Compute the primary decomposition of J. 

Note that the ideal / in the second step corresponds to the ideal ^(-S) defined above 
for the basis B = {v±, . . . , V4}, where the variables u(x) have been restricted to the 
linear subspace kerc A. The ideal J obtained by saturation in the third step is then 
independent of B. 

Unfortunately, this simple algorithm does not work for all sign vectors. Some further 
tricks are needed to compute the primary decomposition within a reasonable time. 
A basis of ker A is given by the rows ui, . . . , 1*5 of the matrix 

/ 1 —1 — 1 1-1 1 1-1-1 1 1-1 1-1-1 1\ 
10-10-1010-101010-10 

1 -1 -1 1 -1 1 1 -1 

1 -1 -1 1 -1 1 1 -1 

\ 1 -1 -1 1 -1 1 1 -1 0/ 



19 



This basis has the following property: Let u = X)i=i ^i u i- If Aj = f° r some j = 2, 3, 4, 5, 
then .D(m) = 0. The reason is that if one Aj vanishes, then it is easy to see that 
there is bijection between the positive and negative entries of u such that corresponding 
entries have the same absolute value. This implies that, in order to determine the global 
maximizer of this model one may saturate J by the product A2A3A4A5. 

Replacing J by (J : (A2A3A4A5) 00 ) makes it possible to solve all but one system of 
equations. For the last sign vector a a special measure is necessary: The complexity of 
the above algorithm depends on the chosen basis v\, V2, V3, V4 of K%. The ^-norm of each 
vector Vi equals twice the degree of the corresponding equation. Thus it is advisable to 
choose the vectors Vi, V2, V3, V4 as short as possible. As a first approximation, one may try 
to use a basis of circuit vectors, i.e., vectors whose support is minimal. This approach 
provides a basis V\, 1)2, 1)3,1)4 for K% of the last sign vector, such that the rest of the 
algorithm sketched above works. 

The calculations were performed with the help of Singular [9]. The primary decomposi- 
tions were done using the algorithm of Gianni, Trager and Zacharias (GTZ) implemented 
in the library solve . lib. Analyzing the results yields the following theorem, confirming 
a conjecture by Thomas Kahle (personal communication): 

Theorem 19. The binary 1^-2 model has, up to symmetry, a unique maximizer of the 
information divergence, which is the uniform distribution over the states 0001, 0010, 
0100, 1000 and 1111. The maximal value of D is log3 - |log5 « 0.56213298, it is 
reached at 

u = ^(-5, 3, 3, -1,3, -1,-1, -1,3, -1,-1, -1,-1, -1,-1,3). (54) 
The maximum value of the D{-\\£) is = log(l + 3 • 5s) « 1.0132035. 



8 Computing the projection points 

The theory of this paper motivates a second method for computing the maximizers of 
D(-\\S), which is more elementary than solving the critical equations. However, knowing 
the critical equations sheds new light on this method. 

Let P + be a projection point and construct P_ as in section [3] Then u = P + — P_ 
and the common ri-projection Pg of P + and P_ satisfy 

i-Peix) lixeZ, 
[-—Pg(x) lixiZ. 

On the other hand, Pg lies in the closure of the exponential family. Suppose that Pg 
has full support. Then the exponential parameterization ([1]) implies that there exist 
a u . . . ,a h > 

Pe(x) = r -^f[af^. (56) 
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Assume that a = sgn(P + — P_) has full support and define a (h + 1) x X-matnx A G 
as follows: Take the matrix A and add a zeroth row with entries 



■0,1 - — 



1 - o- x e {0, 1}. 



(57) 



Then equations (155]) and fl56|) together show that u has the form 



/i 



u 



t — r ^4°" 

(x) =r(x)||a i 



(58) 



i=0 



for suitably chosen aij. Here, ao = ~ yz^ < 0, and all the other parameters are posi- 
tive. The normalization can be achieved since the row span of A contains the constant 
vector. Thus the projection points, which project into £, can be found by plugging the 
parameterization (|58p into the equation Au = and solving for the a,. 

Again, this method simplifies if A has only integer entries. Additionaly it is convenient 
to suppose that A has only nonnegative entries. This nonnegativity requirement can 
always be supposed, since A contains the constant row in its row span. In this case the 
parameterization (1581) is monomial, so the equation Au = is equivalent to h polynomial 
equations in the h + 1 parameters a , . . . , ah- 

This method is linked to the ideal of the previous section. As stated there, * s 
related to the lattice ideal I% r , which defines a toric variety. Every toric variety has 
a monomial "parameterization", which induces the monomial parameterization (1581) . 
Unfortunately, in the general case this monomial parameterization is not surjective. 
However, equation (1581) shows that it is "surjective enough" , at least in the case where 
a has full support. 

It is possible to extend this analysis to the case where a does not have full support. 
Let y = supp(cr). First it is necessary to parameterize the set £ y of those probability 
distributions of £ whose support is y . One solution is to find an element ry £ £ 
such that supp(r^) = y. Then £^ equals the exponential family over the set y with 
reference measure ry whose sufficient statistics matrix Ay consists of those columns of 
A corresponding to y X. This gives a monomial parameterization of £ y with at most 
h parameters. 

The equations obtained from Au = by plugging in a monomial parameterization for 
u can be solved by primary decomposition. Every solution (a , . . . , ah) yields a point 
of X a ' . Theorem [TBI applies in this context. 

Example 20. The above ideas can be used to find the maximizers of the independence 
model of three random variables of cardinalities 2, 3 and 3. This example is particularly 
interesting, since the global maximizers are known for those independence models where 
the cardinality of the state spaces of the random variables satisfy an inequality [3]. The 
cardinalities 2, 3 and 3 are the smallest set of cardinalities that violate this inequality. 
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A sufficient statistics of the model is given by 
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The states are numbered in the ternary representation from 000 to 122, where the 
"highest" random variable only takes two values. The dimension of the model is d — 5 
and the state space has cardinality 18. Thus dim ker A = 18 — 5 — 1 = 12. The symmetry 
group of the model is generated by the permutation of the two random variables of 
cardinality three and by the permutations within the state space of each random variable. 

The cocircuits can be computed by TOPCOM. Testing all 3 18 possible sign vectors of 
length 18 shows that there are 182 796 non-zero sign vectors in ker A (up to symmetry). 
Checking the support condition [TJ leaves 975 sign vectors. Excluding all sign vectors 
where the support of both the negative and the positive part exceeds 6 (cf. Theorem [B]) 
reduces the problem to 240 sign vectors. 

The 72 sign vectors that do not have full support can be treated as in the previous 
section. For the 168 sign vectors that have full support the corresponding systems of 
equations consist of dim ker A — 1 = 11 equations of dim ker A = 12 variables. These 
are too difficult to solve in this way, but they can be treated using the method proposed 
in this section, which "only" requires the primary decomposition of a system of d = 5 
polynomials in d + 1 = 6 variables. 

The analysis was carried out with the help of Singular. It proved to be advantageous 
to use the algorithm of Shimoyama and Yokoyama (SY) from the library solve. lib. 
The following result was obtained: 

Theorem 21. The maximal value of D (■[[£) for the independence model of cardinalities 
2, 3 and 3 equals log (3 + 2v^2) ~ 1.7627472, and the maximal value of D r is log(2(l + 
\/2)) ~ 1.5745208. Up to symmetry there is a unique global maximizing probability 
distribution 



In order to compare the two methods of finding the maximiers of D r resp. 
presented in this section and in the last section let d be the dimension of the model and 
let r = dim ker A. All algorithms are most efficient if A is chosen such that h — d + 1. 
Then, for any sign vector a with full support, the algorithm on page [TS] starts with 
r — 1 equations (corresponding to a basis of K£) in r variables Ai, . . . , A r , which are then 
saturated. On the other hand, the method in this section starts with the d + 1 equations 
Au = in the d + 2 variables «o, . . • , ctd+i- Thus, generically, the first method should 
perform better when the codimension of the model is small, while the second method 
should perform better when the dimension of the model is small. 




(60) 
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9 Conclusions 



In this work a new method for computing the maximizers of the information divergence 
from an exponential family £ has been presented. The original problem of maximizing 
.D(-||£) over the set of all probability distributions is transformed into the maximization 
of a function D r over ker A, where A is the sufficient statistics of S. It has been shown 
that the global maximizers of both problems are equivalent. Furthermore, every local 
maximizer of yields a maximizer of D r . At present it is not known whether the 

converse statement also holds. 

The two main advantages of the reformulation are: 

1. A reduction of the dimension of the problem. 

2. The function D r can be computed by a formula. 

If £ has codimension one, then the first advantage is most visible. Even this simple case 
can be useful in order to obtain examples of maximizers having specific properties. 

The maximizers of D r can be computed by solving the critical equations. These 
equations are nice if they are considered separately for every sign vector a occuring in 
ker A. There are some conditions which allow to exclude certain sign vectors from the 
beginning. If the matrix A contains only integer entries, then the critical equations are 
algebraic, once the sign vector is fixed. In this case tools from commutative algebra can 
be used to solve these equations. 

A second possibility is to compute the points satisfying the projection property. If A 
is an integer matrix and if the sign vector is fixed, then one obtains algebraic equations 
which are related to the critical equations of D. This method is more appropriate for 
exponential families of small dimension. 

Of course, a problem with these two approaches is that every sign vector needs to be 
treated separately, and their number grows quickly. By contrast, the problem of finding 
the maximizers of .D(-||£) becomes a smooth problem if one restricts the support of the 
possible maximizers. In general the set of possible support sets is much smaller than 
the set of sign vectors. Still, two examples have been given where the maximizers where 
not known before and where the separate analysis of each sign vector was feasible. 
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